Há dois datasets neste estudo, um sobre vinho tinto e outro sobre vinho branco.
TOTAL DE OBSERVAÇÕES W = 6497 observações.
Os dois conjuntos de dados estão relacionados a variantes de vinhos tinto e vinhos brancos portugueses. Devido a problemas de privacidade e outras questões, apenas as variáveis fisicoquímicas (entradas) e sensoriais (de saída) estão disponíveis. Neste sentido, não há dados sobre os tipos de uvas, marca de vinho, preço de venda de vinho, etc.)
Ambos os data sets possuem os mesmos 12 atributos:
Esses conjuntos de dados podem ser vistos para atividades de classificação ou regressão. As clases são encomendadas e não equilibradas (por exemplo, existem mais vinhos normais do que excelentes ou pobres). Os algoritmos de detecção de Outlier podem ser usados para detectar alguns vinhos excelentes ou pobres. Além disso, não temos certeza se todas as variáveis de entrada são relevantes. Portanto, pode ser interessante testar métodos de seleção de recursos.
Os dados foram baixados do site: UCI - Machine Learning Repository
Cortez et al., 2009 Modeling wine preferences by data mining from physicochemical properties
Difinição do caminho padrão, para utilizar este código no seu computador, mude a string wine.path para a pasta onde você descarregou os arquivos do projeto. O local onde estarão os dados e demais códigos do R se ajustarão automaticamente.
# ******************************************************************************
# #### SETUP ####
# ******************************************************************************
## work directory path ##
wine.path = "/Users/fernandoperes/dev/r/r-wine/" # to be reused as needed
setwd(wine.path)
## Sources
source(file = paste(wine.path, "wines-utils.R", sep = ""))
## Libraries
library(dplyr)
library(ggplot2)
getwd()
## [1] "/Users/fernandoperes/dev/r/r-wine"
# ******************************************************************************
# #### Load prepared data files ####
# ******************************************************************************
load(file="all-wine.Rda") # load(file="red-wine.Rda") # load(file="white-wine.Rda")
Variáveis temporárias para ajudar a fazer cópia do código para outros atributos, bastando apenas mudá-las os cálculos se ajustarão ao atributo corrente. Objetivo: evitar retrabalho.
#### Inicialização para facilitar cada bloco
## initialization (decrease rework)
x = all.wine$fixed.acidity
field.label = wine.fields.fixed.acidity
field.name = "fixed.acidity"
Observações “wines-utils.R”:
### calculo dos limites inferiores e superiores do atributo
sno.lno = wine.sno.lno(x)
sno.lno
## 25% 75%
## 4.45 9.65
Observações “wines-utils.R”:
Marcar as linhas que são outliers de acordo com os limites calculados anteriormente
Observações “wines-utils.R”:# mark ouliers
all.wine$outlier <- wine.mark.outlier(start = TRUE, df = all.wine,
field = field.name, sno.lno = sno.lno)
table(all.wine$outlier) # after marked outliers
##
## FALSE TRUE
## 6140 357
Nota: TRUE = outlier
# Get the subset of ALL.WINE excluding marked outliers
all.wine.non.outliers <- all.wine %>% filter(all.wine$outlier == FALSE)
x2 <- all.wine.non.outliers$fixed.acidity
# Plot field distribution with ALL (including outliers)
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - com outliers presentes", sep = "")
wine.distribution.plot(title = title, x = x, field.label = field.label,
color = wine.color.all, xlim = c(min(x), max(x)),
sno.lno = sno.lno, line.color = "red")
# Plot field distribution excluding outliers)
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - com outliers removidos", sep = "")
wine.distribution.plot(title = title, x = x2, field.label = field.label,
color = wine.color.all, xlim = c(min(x), max(x)),
sno.lno = sno.lno, line.color = "red" )
## plot the difference all X witout outliers
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - status de outliers", sep = "")
wine.outliers.summary.plot(title = title, t = table(all.wine$outlier),
colors = c("green", "dark red"))
table(all.wine$outlier)
##
## FALSE TRUE
## 6140 357
Nota: TRUE = outlier
Variáveis temporárias para ajudar a fazer cópia do código para outros atributos, bastando apenas mudá-las os cálculos se ajustarão ao atributo corrente. Objetivo: evitar retrabalho.
#### Inicialização para facilitar cada bloco
## initialization (decrease rework)
x = all.wine$volatile.acidity
field.label = wine.fields.volatile.acidity
field.name = "volatile.acidity"
Observações “wines-utils.R”:
### calculo dos limites inferiores e superiores do atributo
sno.lno = wine.sno.lno(x)
sno.lno
## 75%
## 0.080 0.655
Observações “wines-utils.R”:
Marcar as linhas que são outliers de acordo com os limites calculados anteriormente
Observações “wines-utils.R”:# mark ouliers
all.wine$outlier <- wine.mark.outlier(start = TRUE, df = all.wine,
field = field.name, sno.lno = sno.lno)
table(all.wine$outlier) # after marked outliers
##
## FALSE TRUE
## 6120 377
Nota: TRUE = outlier
# Get the subset of ALL.WINE excluding marked outliers
all.wine.non.outliers <- all.wine %>% filter(all.wine$outlier == FALSE)
x2 <- all.wine.non.outliers$volatile.acidity
# Plot field distribution with ALL (including outliers)
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - com outliers presentes", sep = "")
wine.distribution.plot(title = title, x = x, field.label = field.label,
color = wine.color.all, xlim = c(min(x), max(x)),
sno.lno = sno.lno, line.color = "red")
# Plot field distribution excluding outliers)
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - com outliers removidos", sep = "")
wine.distribution.plot(title = title, x = x2, field.label = field.label,
color = wine.color.all, xlim = c(min(x), max(x)),
sno.lno = sno.lno, line.color = "red" )
## plot the difference all X witout outliers
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - status de outliers", sep = "")
wine.outliers.summary.plot(title = title, t = table(all.wine$outlier),
colors = c("green", "dark red"))
table(all.wine$outlier)
##
## FALSE TRUE
## 6120 377
Nota: TRUE = outlier
Variáveis temporárias para ajudar a fazer cópia do código para outros atributos, bastando apenas mudá-las os cálculos se ajustarão ao atributo corrente. Objetivo: evitar retrabalho.
#### Inicialização para facilitar cada bloco
## initialization (decrease rework)
x = all.wine$citric.acid
field.label = wine.fields.citric.acid
field.name = "citric.acid"
Observações “wines-utils.R”:
### calculo dos limites inferiores e superiores do atributo
sno.lno = wine.sno.lno(x)
sno.lno
## 25% 75%
## 0.04 0.60
Observações “wines-utils.R”:
Marcar as linhas que são outliers de acordo com os limites calculados anteriormente
Observações “wines-utils.R”:# mark ouliers
all.wine$outlier <- wine.mark.outlier(start = TRUE, df = all.wine,
field = field.name, sno.lno = sno.lno)
table(all.wine$outlier) # after marked outliers
##
## FALSE TRUE
## 5988 509
Nota: TRUE = outlier
# Get the subset of ALL.WINE excluding marked outliers
all.wine.non.outliers <- all.wine %>% filter(all.wine$outlier == FALSE)
x2 <- all.wine.non.outliers$citric.acid
# Plot field distribution with ALL (including outliers)
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - com outliers presentes", sep = "")
wine.distribution.plot(title = title, x = x, field.label = field.label,
color = wine.color.all, xlim = c(min(x), max(x)),
sno.lno = sno.lno, line.color = "red")
# Plot field distribution excluding outliers)
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - com outliers removidos", sep = "")
wine.distribution.plot(title = title, x = x2, field.label = field.label,
color = wine.color.all, xlim = c(min(x), max(x)),
sno.lno = sno.lno, line.color = "red" )
## plot the difference all X witout outliers
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - status de outliers", sep = "")
wine.outliers.summary.plot(title = title, t = table(all.wine$outlier),
colors = c("green", "dark red"))
table(all.wine$outlier)
##
## FALSE TRUE
## 5988 509
Nota: TRUE = outlier
Variáveis temporárias para ajudar a fazer cópia do código para outros atributos, bastando apenas mudá-las os cálculos se ajustarão ao atributo corrente. Objetivo: evitar retrabalho.
#### Inicialização para facilitar cada bloco
## initialization (decrease rework)
x = all.wine$residual.sugar
field.label = wine.fields.residual.sugar
field.name = "residual.sugar"
Observações “wines-utils.R”:
### calculo dos limites inferiores e superiores do atributo
sno.lno = wine.sno.lno(x)
sno.lno
## 75%
## 0.60 17.55
Observações “wines-utils.R”:
Marcar as linhas que são outliers de acordo com os limites calculados anteriormente
Observações “wines-utils.R”:# mark ouliers
all.wine$outlier <- wine.mark.outlier(start = TRUE, df = all.wine,
field = field.name, sno.lno = sno.lno)
table(all.wine$outlier) # after marked outliers
##
## FALSE TRUE
## 6379 118
Nota: TRUE = outlier
# Get the subset of ALL.WINE excluding marked outliers
all.wine.non.outliers <- all.wine %>% filter(all.wine$outlier == FALSE)
x2 <- all.wine.non.outliers$residual.sugar
# Plot field distribution with ALL (including outliers)
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - com outliers presentes", sep = "")
wine.distribution.plot(title = title, x = x, field.label = field.label,
color = wine.color.all, xlim = c(min(x), max(x)),
sno.lno = sno.lno, line.color = "red")
# Plot field distribution excluding outliers)
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - com outliers removidos", sep = "")
wine.distribution.plot(title = title, x = x2, field.label = field.label,
color = wine.color.all, xlim = c(min(x), max(x)),
sno.lno = sno.lno, line.color = "red" )
## plot the difference all X witout outliers
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - status de outliers", sep = "")
wine.outliers.summary.plot(title = title, t = table(all.wine$outlier),
colors = c("green", "dark red"))
table(all.wine$outlier)
##
## FALSE TRUE
## 6379 118
Nota: TRUE = outlier
Variáveis temporárias para ajudar a fazer cópia do código para outros atributos, bastando apenas mudá-las os cálculos se ajustarão ao atributo corrente. Objetivo: evitar retrabalho.
#### Inicialização para facilitar cada bloco
## initialization (decrease rework)
x = all.wine$chlorides
field.label = wine.fields.chlorides
field.name = "chlorides"
Observações “wines-utils.R”:
### calculo dos limites inferiores e superiores do atributo
sno.lno = wine.sno.lno(x)
sno.lno
## 75%
## 0.0090 0.1055
Observações “wines-utils.R”:
Marcar as linhas que são outliers de acordo com os limites calculados anteriormente
Observações “wines-utils.R”:# mark ouliers
all.wine$outlier <- wine.mark.outlier(start = TRUE, df = all.wine,
field = field.name, sno.lno = sno.lno)
table(all.wine$outlier) # after marked outliers
##
## FALSE TRUE
## 6211 286
Nota: TRUE = outlier
# Get the subset of ALL.WINE excluding marked outliers
all.wine.non.outliers <- all.wine %>% filter(all.wine$outlier == FALSE)
x2 <- all.wine.non.outliers$chlorides
# Plot field distribution with ALL (including outliers)
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - com outliers presentes", sep = "")
wine.distribution.plot(title = title, x = x, field.label = field.label,
color = wine.color.all, xlim = c(min(x), max(x)),
sno.lno = sno.lno, line.color = "red")
# Plot field distribution excluding outliers)
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - com outliers removidos", sep = "")
wine.distribution.plot(title = title, x = x2, field.label = field.label,
color = wine.color.all, xlim = c(min(x), max(x)),
sno.lno = sno.lno, line.color = "red" )
## plot the difference all X witout outliers
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - status de outliers", sep = "")
wine.outliers.summary.plot(title = title, t = table(all.wine$outlier),
colors = c("green", "dark red"))
table(all.wine$outlier)
##
## FALSE TRUE
## 6211 286
Nota: TRUE = outlier
Variáveis temporárias para ajudar a fazer cópia do código para outros atributos, bastando apenas mudá-las os cálculos se ajustarão ao atributo corrente. Objetivo: evitar retrabalho.
#### Inicialização para facilitar cada bloco
## initialization (decrease rework)
x = all.wine$free.sulfur.dioxide
field.label = wine.fields.free.sulfur.dioxide
field.name = "free.sulfur.dioxide"
Observações “wines-utils.R”:
### calculo dos limites inferiores e superiores do atributo
sno.lno = wine.sno.lno(x)
sno.lno
## 75%
## 1 77
Observações “wines-utils.R”:
Marcar as linhas que são outliers de acordo com os limites calculados anteriormente
Observações “wines-utils.R”:# mark ouliers
all.wine$outlier <- wine.mark.outlier(start = TRUE, df = all.wine,
field = field.name, sno.lno = sno.lno)
table(all.wine$outlier) # after marked outliers
##
## FALSE TRUE
## 6435 62
Nota: TRUE = outlier
# Get the subset of ALL.WINE excluding marked outliers
all.wine.non.outliers <- all.wine %>% filter(all.wine$outlier == FALSE)
x2 <- all.wine.non.outliers$free.sulfur.dioxide
# Plot field distribution with ALL (including outliers)
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - com outliers presentes", sep = "")
wine.distribution.plot(title = title, x = x, field.label = field.label,
color = wine.color.all, xlim = c(min(x), max(x)),
sno.lno = sno.lno, line.color = "red")
# Plot field distribution excluding outliers)
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - com outliers removidos", sep = "")
wine.distribution.plot(title = title, x = x2, field.label = field.label,
color = wine.color.all, xlim = c(min(x), max(x)),
sno.lno = sno.lno, line.color = "red" )
## plot the difference all X witout outliers
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - status de outliers", sep = "")
wine.outliers.summary.plot(title = title, t = table(all.wine$outlier),
colors = c("green", "dark red"))
table(all.wine$outlier)
##
## FALSE TRUE
## 6435 62
Nota: TRUE = outlier
Variáveis temporárias para ajudar a fazer cópia do código para outros atributos, bastando apenas mudá-las os cálculos se ajustarão ao atributo corrente. Objetivo: evitar retrabalho.
#### Inicialização para facilitar cada bloco
## initialization (decrease rework)
x = all.wine$total.sulfur.dioxide
field.label = wine.fields.total.sulfur.dioxide
field.name = "total.sulfur.dioxide"
Observações “wines-utils.R”:
### calculo dos limites inferiores e superiores do atributo
sno.lno = wine.sno.lno(x)
sno.lno
## 75%
## 6.0 274.5
Observações “wines-utils.R”:
Marcar as linhas que são outliers de acordo com os limites calculados anteriormente
Observações “wines-utils.R”:# mark ouliers
all.wine$outlier <- wine.mark.outlier(start = TRUE, df = all.wine,
field = field.name, sno.lno = sno.lno)
table(all.wine$outlier) # after marked outliers
##
## FALSE TRUE
## 6487 10
Nota: TRUE = outlier
# Get the subset of ALL.WINE excluding marked outliers
all.wine.non.outliers <- all.wine %>% filter(all.wine$outlier == FALSE)
x2 <- all.wine.non.outliers$total.sulfur.dioxide
# Plot field distribution with ALL (including outliers)
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - com outliers presentes", sep = "")
wine.distribution.plot(title = title, x = x, field.label = field.label,
color = wine.color.all, xlim = c(min(x), max(x)),
sno.lno = sno.lno, line.color = "red")
# Plot field distribution excluding outliers)
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - com outliers removidos", sep = "")
wine.distribution.plot(title = title, x = x2, field.label = field.label,
color = wine.color.all, xlim = c(min(x), max(x)),
sno.lno = sno.lno, line.color = "red" )
## plot the difference all X witout outliers
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - status de outliers", sep = "")
wine.outliers.summary.plot(title = title, t = table(all.wine$outlier),
colors = c("green", "dark red"))
table(all.wine$outlier)
##
## FALSE TRUE
## 6487 10
Nota: TRUE = outlier
Variáveis temporárias para ajudar a fazer cópia do código para outros atributos, bastando apenas mudá-las os cálculos se ajustarão ao atributo corrente. Objetivo: evitar retrabalho.
#### Inicialização para facilitar cada bloco
## initialization (decrease rework)
x = all.wine$density
field.label = wine.fields.density
field.name = "density"
Observações “wines-utils.R”:
### calculo dos limites inferiores e superiores do atributo
sno.lno = wine.sno.lno(x)
sno.lno
## 75%
## 0.987110 1.003965
Observações “wines-utils.R”:
Marcar as linhas que são outliers de acordo com os limites calculados anteriormente
Observações “wines-utils.R”:# mark ouliers
all.wine$outlier <- wine.mark.outlier(start = TRUE, df = all.wine,
field = field.name, sno.lno = sno.lno)
table(all.wine$outlier) # after marked outliers
##
## FALSE TRUE
## 6494 3
Nota: TRUE = outlier
# Get the subset of ALL.WINE excluding marked outliers
all.wine.non.outliers <- all.wine %>% filter(all.wine$outlier == FALSE)
x2 <- all.wine.non.outliers$density
# Plot field distribution with ALL (including outliers)
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - com outliers presentes", sep = "")
wine.distribution.plot(title = title, x = x, field.label = field.label,
color = wine.color.all, xlim = c(min(x), max(x)),
sno.lno = sno.lno, line.color = "red")
# Plot field distribution excluding outliers)
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - com outliers removidos", sep = "")
wine.distribution.plot(title = title, x = x2, field.label = field.label,
color = wine.color.all, xlim = c(min(x), max(x)),
sno.lno = sno.lno, line.color = "red" )
## plot the difference all X witout outliers
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - status de outliers", sep = "")
wine.outliers.summary.plot(title = title, t = table(all.wine$outlier),
colors = c("green", "dark red"))
table(all.wine$outlier)
##
## FALSE TRUE
## 6494 3
Nota: TRUE = outlier
Variáveis temporárias para ajudar a fazer cópia do código para outros atributos, bastando apenas mudá-las os cálculos se ajustarão ao atributo corrente. Objetivo: evitar retrabalho.
#### Inicialização para facilitar cada bloco
## initialization (decrease rework)
x = all.wine$pH
field.label = wine.fields.pH
field.name = "pH"
Observações “wines-utils.R”:
### calculo dos limites inferiores e superiores do atributo
sno.lno = wine.sno.lno(x)
sno.lno
## 25% 75%
## 2.795 3.635
Observações “wines-utils.R”:
Marcar as linhas que são outliers de acordo com os limites calculados anteriormente
Observações “wines-utils.R”:# mark ouliers
all.wine$outlier <- wine.mark.outlier(start = TRUE, df = all.wine,
field = field.name, sno.lno = sno.lno)
table(all.wine$outlier) # after marked outliers
##
## FALSE TRUE
## 6424 73
Nota: TRUE = outlier
# Get the subset of ALL.WINE excluding marked outliers
all.wine.non.outliers <- all.wine %>% filter(all.wine$outlier == FALSE)
x2 <- all.wine.non.outliers$pH
# Plot field distribution with ALL (including outliers)
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - com outliers presentes", sep = "")
wine.distribution.plot(title = title, x = x, field.label = field.label,
color = wine.color.all, xlim = c(min(x), max(x)),
sno.lno = sno.lno, line.color = "red")
# Plot field distribution excluding outliers)
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - com outliers removidos", sep = "")
wine.distribution.plot(title = title, x = x2, field.label = field.label,
color = wine.color.all, xlim = c(min(x), max(x)),
sno.lno = sno.lno, line.color = "red" )
## plot the difference all X witout outliers
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - status de outliers", sep = "")
wine.outliers.summary.plot(title = title, t = table(all.wine$outlier),
colors = c("green", "dark red"))
table(all.wine$outlier)
##
## FALSE TRUE
## 6424 73
Nota: TRUE = outlier
Variáveis temporárias para ajudar a fazer cópia do código para outros atributos, bastando apenas mudá-las os cálculos se ajustarão ao atributo corrente. Objetivo: evitar retrabalho.
#### Inicialização para facilitar cada bloco
## initialization (decrease rework)
x = all.wine$sulphates
field.label = wine.fields.sulphates
field.name = "sulphates"
Observações “wines-utils.R”:
### calculo dos limites inferiores e superiores do atributo
sno.lno = wine.sno.lno(x)
sno.lno
## 75%
## 0.220 0.855
Observações “wines-utils.R”:
Marcar as linhas que são outliers de acordo com os limites calculados anteriormente
Observações “wines-utils.R”:# mark ouliers
all.wine$outlier <- wine.mark.outlier(start = TRUE, df = all.wine,
field = field.name, sno.lno = sno.lno)
table(all.wine$outlier) # after marked outliers
##
## FALSE TRUE
## 6306 191
Nota: TRUE = outlier
# Get the subset of ALL.WINE excluding marked outliers
all.wine.non.outliers <- all.wine %>% filter(all.wine$outlier == FALSE)
x2 <- all.wine.non.outliers$sulphates
# Plot field distribution with ALL (including outliers)
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - com outliers presentes", sep = "")
wine.distribution.plot(title = title, x = x, field.label = field.label,
color = wine.color.all, xlim = c(min(x), max(x)),
sno.lno = sno.lno, line.color = "red")
# Plot field distribution excluding outliers)
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - com outliers removidos", sep = "")
wine.distribution.plot(title = title, x = x2, field.label = field.label,
color = wine.color.all, xlim = c(min(x), max(x)),
sno.lno = sno.lno, line.color = "red" )
## plot the difference all X witout outliers
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - status de outliers", sep = "")
wine.outliers.summary.plot(title = title, t = table(all.wine$outlier),
colors = c("green", "dark red"))
table(all.wine$outlier)
##
## FALSE TRUE
## 6306 191
Nota: TRUE = outlier
Variáveis temporárias para ajudar a fazer cópia do código para outros atributos, bastando apenas mudá-las os cálculos se ajustarão ao atributo corrente. Objetivo: evitar retrabalho.
#### Inicialização para facilitar cada bloco
## initialization (decrease rework)
x = all.wine$alcohol
field.label = wine.fields.alcohol
field.name = "alcohol"
Observações “wines-utils.R”:
### calculo dos limites inferiores e superiores do atributo
sno.lno = wine.sno.lno(x)
sno.lno
## 75%
## 8 14
Observações “wines-utils.R”:
Marcar as linhas que são outliers de acordo com os limites calculados anteriormente
Observações “wines-utils.R”:# mark ouliers
all.wine$outlier <- wine.mark.outlier(start = TRUE, df = all.wine,
field = field.name, sno.lno = sno.lno)
table(all.wine$outlier) # after marked outliers
##
## FALSE TRUE
## 6494 3
Nota: TRUE = outlier
# Get the subset of ALL.WINE excluding marked outliers
all.wine.non.outliers <- all.wine %>% filter(all.wine$outlier == FALSE)
x2 <- all.wine.non.outliers$alcohol
# Plot field distribution with ALL (including outliers)
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - com outliers presentes", sep = "")
wine.distribution.plot(title = title, x = x, field.label = field.label,
color = wine.color.all, xlim = c(min(x), max(x)),
sno.lno = sno.lno, line.color = "red")
# Plot field distribution excluding outliers)
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - com outliers removidos", sep = "")
wine.distribution.plot(title = title, x = x2, field.label = field.label,
color = wine.color.all, xlim = c(min(x), max(x)),
sno.lno = sno.lno, line.color = "red" )
## plot the difference all X witout outliers
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - status de outliers", sep = "")
wine.outliers.summary.plot(title = title, t = table(all.wine$outlier),
colors = c("green", "dark red"))
table(all.wine$outlier)
##
## FALSE TRUE
## 6494 3
Nota: TRUE = outlier
Variáveis temporárias para ajudar a fazer cópia do código para outros atributos, bastando apenas mudá-las os cálculos se ajustarão ao atributo corrente. Objetivo: evitar retrabalho.
#### Inicialização para facilitar cada bloco
## initialization (decrease rework)
x = all.wine$quality
field.label = wine.fields.quality
field.name = "quality"
Observações “wines-utils.R”:
### calculo dos limites inferiores e superiores do atributo
sno.lno = wine.sno.lno(x)
sno.lno
## 25% 75%
## 3.5 7.5
Observações “wines-utils.R”:
Marcar as linhas que são outliers de acordo com os limites calculados anteriormente
Observações “wines-utils.R”:# mark ouliers
all.wine$outlier <- wine.mark.outlier(start = TRUE, df = all.wine,
field = field.name, sno.lno = sno.lno)
table(all.wine$outlier) # after marked outliers
##
## FALSE TRUE
## 6269 228
Nota: TRUE = outlier
# Get the subset of ALL.WINE excluding marked outliers
all.wine.non.outliers <- all.wine %>% filter(all.wine$outlier == FALSE)
x2 <- all.wine.non.outliers$quality
# Plot field distribution with ALL (including outliers)
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - com outliers presentes", sep = "")
wine.distribution.plot(title = title, x = x, field.label = field.label,
color = wine.color.all, xlim = c(min(x), max(x)),
sno.lno = sno.lno, line.color = "red")
# Plot field distribution excluding outliers)
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - com outliers removidos", sep = "")
wine.distribution.plot(title = title, x = x2, field.label = field.label,
color = wine.color.all, xlim = c(min(x), max(x)),
sno.lno = sno.lno, line.color = "red" )
## plot the difference all X witout outliers
par.customized <- par(mfrow = c(1, 2))
title = paste(field.label, " - status de outliers", sep = "")
wine.outliers.summary.plot(title = title, t = table(all.wine$outlier),
colors = c("green", "dark red"))
table(all.wine$outlier)
##
## FALSE TRUE
## 6269 228
Nota: TRUE = outlier
Será rodado em um arquivo separado a marcação e filtro dos outliers, este documento se focou na análise de outliers e o conjunto de outliers foi resatado para a análise de cada campo. Para o filtro e geração da base dem outliers será feito acumulações somente dos campos que serão filtrados.
A seguir os trechos de códigos fonte de “wines-utils.R” utilizado.
Para mais informações, consulte o arquivo “wines-utils.R na raiz do projeto.
#### Fields label constants ####
if (wine.language == "PT-BR"){
wine.fields.fixed.acidity = "Acidez fixa"
wine.fields.volatile.acidity = "Acidez volatil"
wine.fields.citric.acid = "Acido citrico"
wine.fields.residual.sugar = "Acucar residual"
wine.fields.chlorides = "Cloretos"
wine.fields.free.sulfur.dioxide = "Livre de dioxido de enxofre"
wine.fields.total.sulfur.dioxide = "Total de dioxido de enxofre"
wine.fields.density = "Densidade"
wine.fields.pH = "pH"
wine.fields.sulphates = "Sulfatos"
wine.fields.alcohol = "Teor alcolico"
wine.fields.quality = "Qualidade"
wine.fields.color = "Cor"
wine.fields.col.color = "Cor para cor do vinho"
wine.fields.taste = "Gosto (Conceito)"
} else { # If it is not configured the defualt is ENGLISH
wine.fields.fixed.acidity = "fixed.acidity"
wine.fields.volatile.acidity = "volatile.acidity"
wine.fields.citric.acid = "citric.acid"
wine.fields.residual.sugar = "residual.sugar"
wine.fields.chlorides = "chlorides"
wine.fields.free.sulfur.dioxide = "free.sulfur.dioxide"
wine.fields.total.sulfur.dioxide = "total.sulfur.dioxide"
wine.fields.density = "density"
wine.fields.pH = "pH"
wine.fields.sulphates = "sulphates"
wine.fields.alcohol = "alcohol"
wine.fields.quality = "quality"
wine.fields.color = "color"
wine.fields.col.color = "col.color"
wine.fields.taste = "taste"
}
#*******************************************************************************
# #### Plot the field distrution ####
# Boxplot and a histohram to present the distribution of the current field
#*******************************************************************************
wine.distribution.plot <- function(title, x, field.label, color, xlim, sno.lno, line.color) {
wine.boxplot(title = title,x = x, xlab = field.label, color = color,
xlim = xlim, sno.lno = sno.lno, line.color = line.color)
wine.histogram(title = title, x = x, xlab = field.label, color = color,
xlim = xlim, sno.lno = sno.lno, line.color = line.color)
}
#*******************************************************************************
# #### Wine boxplot ####
# Wine Boxplot specially prepared to present the distribution of the current
# field
#*******************************************************************************
wine.boxplot <- function(title, x, xlab, color, xlim, sno.lno, line.color){
## Box plot
boxplot(x = x, xlab = xlab,
main = title,
col = color,
ylim = xlim, # main = main0,
cex.axis = 0.75, cex.lab = 0.75, cex.main = 0.85,
horizontal = T,
frame = F)
## Lines
# smallest non-outlier
abline(v = sno.lno[1],
lwd = 2,
col = line.color)
# median or q2
abline(v = median(x),
lwd = 2,
col = line.color)
# largest non-outlier
abline(v = sno.lno[2],
lwd = 2,
col = line.color)
}
#*******************************************************************************
# #### wine histogram ####
# Wine histogram specially prepared to present the distribution of the current
# field.
#
# Note: sno.lno = smallest non-outlier . largest non-outlier
#*******************************************************************************
wine.histogram <- function(title, x, xlab, color, xlim, sno.lno, line.color){
# get highest count of hist breaks (to avoid cut labels when plotted)
yhist <- hist(x, plot = FALSE)
highestCount <- max(yhist$count) * 1.1
# histogram
h <- hist(x = x, xlab = xlab,
main = title,
ylab = "Frequencia",
col = color, xlim = xlim, ylim = c(0, highestCount * 1.1),
cex.main = 0.85, adj = 0, include.lowest = TRUE, cex.axis = 0.75,
cex.lab = 0.75, labels = TRUE)
xfit <- seq(min(x), max(x), length = 40)
yfit <- dnorm(xfit, mean = mean(x), sd = sd(x))
yfit <- yfit * diff(h$mids[1:2]) * length(x)
lines(xfit, yfit, col = "blue", lwd = 2)
## Lines
# smallest non-outlier
abline(v = sno.lno[1],
lwd = 2,
col = line.color)
# median or q2
abline(v = median(x),
lwd = 2,
col = line.color)
# largest non-outlier
abline(v = sno.lno[2],
lwd = 2,
col = line.color)
}
#*******************************************************************************
# #### wine sno.lno ####
# Wine lils() calc the limits non outliers
# - sno = smallest non-outlier
# - lno = largest non-outlier
#*******************************************************************************
wine.sno.lno<- function(x) {
q1 <- quantile(x, probs=c(.25), na.rm = T)
q3 <- quantile(x, probs=c(.75), na.rm = T)
# iqr = q3 - q1
sno <- q1 - 1.5 * IQR(x) # sno = smallest non-outlier
if (sno < min(x)) sno = min(x)
lno <- q3 + 1.5 * IQR(x) # lno = largest non-outlie
if (lno > max(x)) lno = max(x)
sno.lno <- c(sno, lno)
sno.lno
}
#*******************************************************************************
# #### wine mark outliers ####
# according field limits (sno - lno) the line are tagged as outliers
# - sno = smallest non-outlier
# - lno = largest non-outlier
#*******************************************************************************
wine.mark.outlier <- function(start, df, field, sno.lno){
# if it is the first time or if it should be reseted, the start is true
# else start is false, it will continue tagging the outliers
if (start) df$outlier = FALSE
# mark true for outliers lines
df$outlier[df[field] < sno.lno[1]] <- TRUE
df$outlier[df[field] > sno.lno[2]] <- TRUE
df$outlier <- as.factor(df$outlier)
df$outlier
}
#*******************************************************************************
# #### wine outliers plot ####
#*******************************************************************************
wine.outliers.summary.plot <- function(title, t, colors){
# calc the percentuals to be used on labels
p = 100 * t / sum(t)
p = round(p, digits = 2)
# customized label
labs = c(paste("dados ", p[1], "%", sep = ""),
paste("outliers ", p[2], "%", sep = ""))
# pie chart
pie(x = p,
main = title,
labels = labs,
col = colors,
cex.axis = 0.75,
cex.lab = 0.75,
cex.main = 0.85)
# bar chart
bp <- barplot( t,
main = title,
col = colors,
cex.axis = 0.75,
cex.lab = 0.75,
cex.main = 0.85,
horiz = TRUE,
beside = TRUE)
}